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Abstract. We apply the holonomic gradient descent recently introduced in 
[TO] to the maximum likelihood estimate (MLE) of the Fisher-Bingham dis- 
tribution on the <i-dimensional sphere. We derive a Pfafhan system (an inte- 
grable connection) and a series expansion associated to the normalizing con- 
stant. These enable us to solve some MLE problems up to dimension d = 7 
with accuracy. 
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1 Introduction 

Let x = (xij) and y = (j/j) be a matrix parameter of size x (d + 1) such 

that x^ = Xji for i ^ j and a vector parameter of size d+1 respectively. We 
are interested in the Fisher-Bingham probability distribution 

if d+i \ 

H(x,y,r;t)\dt\ := -^T ^ exp x ijUtj + ^ViU J \dt\ (1) 

on the (i-dimensional sphere S d (r) = {(ti, . . . , i<j+i) I Si=i = r 2 ,r > 0} and 
the maximum likelihood estimate with respect to this probability distribution. 
Here, the function Z is the normalizing constant defined as 



Z(x,y,r)= exp ^ xgUtj + ^ ViU \dt\ (2) 

Js "( r ) \l<Kj<i+l i=l J 
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and \dt\ denotes the standard measure on the sphere. 

The maximum likelihood estimate (MLE) is nothing but the problem of 
finding the maximum of the function in x, y 

fc=i 

for given data vectors T^ k \ k = I, . . . , N in the i-space. In order to compute 
the MLE, we need approximate values of the normalizing constant Z and its 
derivatives. In case of d = 2, the normalizing constant is expressed in terms 
of the Bessel function and there are several approaches for the MLE in the 
directional statistics [3], [5], [T3]. However, there are a few studies on approx- 
imations of the normalizing constant for the case of d > 2 and applications to 
the MLE. Among them, Kumc and Wood [7] proposed a method to evaluate the 
normalizing constant by utilizing the Laplace approximation of the integral for 
d > 2 and Kumc and Walker [5] gave a series approximation of the normalizing 
constant. 

In this paper, we propose a different method to evaluate it and present ap- 
plications to the MLE. Our method is the holonomic gradient descent (HGD) 
proposed in |10j , which utilizes a holonomic system of linear differential equa- 
tions satisfied by the normalizing constant and gives the MLE with accuracy. 
The HGD consists of 4 steps. The first step is to derive a holonomic system 
of linear partial differential equations for the normalizing constant. The second 
step is to translate the holonomic system into a Pfaffian system, which is roughly 
speaking a set of ordinary differential equations with respect to the variables 
Xij and yi for the normalizing constant. These two steps can be performed by a 
symbolic computation (the Grobner basis method) if the size of the problem is 
moderate. The rest steps utilize numerical analysis. The third step is to evalu- 
ate the normalizing constant and its derivatives at an initial point. To do this, 
we can use a numerical integration for a rough evaluation or a series expansion 
for an accurate evaluation. The last step is to extend the evaluated values to 
other points needed for the MLE by the Pfaffian system and a numerical solver 
of ordinary differential equations. 

It is shown in [5] and |10j that the normalizing constant of the Fisher- 
Bingham distribution is a holonomic function in x, y, r and consequently it is 
annihilated by a holonomic ideal of which explicit expression is given. In order 
to use the HGD, we need to translate the ideal into a Pfaffian system, which 
can be done by a Grobner basis computation (see, e.g., the appendix of |10j). 
It is done on a computer for d < 2 in [TU], however, it cannot be done for d > 2 
on current computers and Grobner basis algorithms due to high computational 
complexity. 

In this paper, we derive the Pfaffian system for the general d by hand and 
derive series expansion of the normalizing constant. We will present that the 
HGD works well up to d = 7 with accuracy for some class of problems by 
utilizing them. 
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2 The Pfaffian System for the Normalizing Con- 
stant 



It is shown in [TU] and [5] that the normalizing constant Z @ of the Fisher- 
Bingham distribution is a holonomic function in x, y, r and consequently it is 
annihilated by a holonomic ideal i\ The holonomic ideal / is generated by the 
following operators in the ring of differential operators. 

dij-didj (l<i<j<d+l), (3) 

d+1 

£df-r 2 , (4) 

1=1 

Xijdl + 2(xjj - x li )d l d j - Xijd'j + £ (x kj d l d k - x lk djd k ) 

l<k<d+l,k^i,j 

+y j d i -y i d j (l<i<j <d + l), (5) 

d+1 

rd r - 2 £ Xijd t dj - £y;<9; - d (6) 

l<i<j<d+l i=l 

Here, dij stands for g§-, 9» for and 9 r for Note that we assume 

Xij ^ji ■ 

In order to apply the HGD for the MLE of the Fisher-Bingham distribution, 
we need an explicit expression of the Pfaffian system associated to the holonomic 
ideal I for an input to a numerical solver. Let us review the definition of the 
Pfaffian system (see, e.g., [10| as to details). Let rank(J) be the holonomic rank 
of the ideal / in the ring of differential operators and F a vector of the standard 
monomials of a Grobner basis of /. The length of this vector is rank(J). We 
denote each clement of F by d a , a G S. We suppose that the first element of F 
is d° = 1. The Pfaffian system is a set of differential operators 

dij — Hij, d k — H k , d r — H r 

where Hij, H k , H r are rank(J) x rank(J) matrices with rational function entries 
which satisfy 

VoV = 0, V = - £ Hijdxij - ^2 H kdy k - H r dr 

and annihilate the vector valued function F(Z) = {d a Z \ a £ S) T . The symbol 
d in the definition of V is not the dimension of the sphere and denotes the 
exterior derivative with respect to the variables x^, y k , r. In some literatures, 
the definition of the Pfaffian system does not include the integrability condition 
V o V = 0, but we shortly call the integrable Pfaffian system of equations the 
Pfaffian system. Our Pfaffian system is nothing but the integrable connection 
V. 

The Pfaffian system can be obtained by an algorithmic method explained, 
e.g., in |10j and the computation for the Fisher-Bingham distribution was done 
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on a computer up to d = 2. In this section, we give the PfafHan system for the 
general dimension. 

Before going to the discussion in the e?-dimcnsional case, we illustrate our 
method in the case of d — 1 and r = 1. Let I\ be the left ideal generated by 

dn-df, 8,2-8,82, 822 -81, (7) 

df + di-i, (8) 

x 12 8i + 2(x 2 2 - xxi)d\d2 - x\%d\ + y 2 di - yid 2 (9) 

in the ring of differential operators. The holonomic rank of I± is 4. Let F be 
a vector of operators (1, d\, 82, 8f) T . We want to find a matrix P of which 
entries are rational functions such that 8\F = PF holds modulo the left ideal 
I\. Here, s = t means that each element of s — t belongs to I\. Since 8\F = 
(8\, 81,8182, 8 i) T , we need to express 8182 and 8f in terms of F modulo I±. 
Eliminating 8\ from © by ©, we obtain 

2(^22 - x\\)8\d2 = -x 12 8\ + x 12 8l - y 2 8i + yid 2 

= -x 12 8f + 2:12(1 - df) - y 2 8i + yid 2 

(the underlined term is reduced by (JHJ) 
= (x 12 ,-y2,yi,-2x 12 )F (10) 

Thus, we have expressed 8182 in terms of F. From 8\ x ([§]), we obtain 

X12&1 + 2(x 2 2 - x 11 )8fd2 = xi 2 didl_ - y2&f + y 1 8182 + 82 
= xMl - Of) - y 2 df + -( Xl2 - y 2 8 1 + yi 8 2 - 2x 12 8 2 1 ) + 8 2 

2(X 2 2 - Xu) 



(the underlined terms are reduced by (jH)) and (|10p). 
and consequently we have 2xi2df + 2(x 2 2 — x\\)8\8"2, = (a, b, c, d)F, where 



V1X12 L 2/12/2 

X12 



2(X22-XnY 2(x 2 2 - X\i) ' 

I + 777 7> d =-y 2 



2{X13—Xxi) X22-XU 

By a similar computation for 82 x we have 

2(x 22 - x lx )8l - 2x 12 dfd2 = (a 1 , b' , c', d')F, 



where 



^122/2 1 , „, > 2/2 

-2/1 + 777 7> 6=1 + 2(x 22 - 



/ 



2(x22-2:il)' 2(X22 - Xu) 1 

2/12/2 2:122/2 

777 7: d =2/1 • 

2(x 2 2 - am) x 22 - am 
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Therefore, we obtain 

2xi2 2(x 2 2-xn)\ ( d\ \ = fa b c d\ 
2(x 22 - Xll ) -2x l2 )\dld 2 )-\a' b' d d'J*- 

„ ,,. i • ,l . i • ( 2xi2 2{x 22 -x\\)\ 1 

Multiplying the inverse matrix _ , . „ , we can express 

y B \2{x 22 -x n ) -2xi 2 J 

df in terms of F. Thus we obtain the explicit expression of d\F = PF. The 

identity d\F = PF gives a Pfaffian equation for the direction y±. In other 

words, the differential equation 

dF(Z) _ DE1/ ^ Vtr 7 \_ f r 7 r 7 ry ry , T 



dyi 



PF(Z), F(Z) = (Z,Z yi Z y2 ,Z yiyi y 



holds. This is an ordinary differential equation for the vector valued function 
F(Z) with respect to the variable y±. Ordinary differential equations for the 
other directions can be obtained analogously. 

Let us go to discussions for the general d. Let F be a vector of operators 

{i,d 1 ,...,d d+l ,dl,...,dl) T . (11) 

We denote by fa the i-th element of the vector F. We define two auxiliary 
vectors of operators to present the expression. We sort the set of the square free 
second order operators 

{didj\l<i<j<d + i} 

by the lexicographic order. This gives a vector of operators of the length d(d + 
l)/2: 

F& =(d 1 d 2 ,d 1 d 3 ,...,d d d d+1 ) T . (12) 
We sort the set of the third order operators 

{didjd k \l <i<j <k<d + l, j < d} 

by the lexicographic order. We denote by F^ the sorted vector 

f^ = {drdxdu d 1 d 1 d 2 ,...,d 1 d 1 d d+1 , d 1 d 2 d 2 ,...,d d d d d d+1 ) T . (13) 

The length of this vector is denoted by m — d(d + l)(d + 5)/6. 

When two operators l\ and i 2 are the same modulo the ideal /, we denote 
it by l\ = i 2 . By examining the proof of Lemmas 2 and 3 of [10], we obtain 
the following two lemmas which give an expression of the second and the third 
order operators F^ and F^ in terms of F. 

Lemma 1 We have 

p(2)_p(2) + Q(2)p = q ( 14 ) 
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Here, P^ is an invertible d(d + l)/2 x d(d + l)/2 matrix and is a d(d + 
l)/2 x (2d + 2) matrix of which entries are as follows. 



P- 



(2) 



ij,kl 



2(xjj Xii) (i — k,j — /) 
Xji (i = k,jjtl) 

Xjk (i = l,j^k) 

-x ik (i^k,j = l) 

^-xu (i^l,j = k) 

VjSk,i+l ~ Vifaj+l + Xi]$k,i+d+2 - Xijh,j+d+2 (j < d) 

Vjh,i+i - Vihj+i + Xij5 k ,i+d+2 - r 2 Xi, d +iSki + J2e=i %i,d+ih,e+d+2 (j = d+ 1) 



Here, S is Kronecker's 6 and P^ kl is the matrix element of P^ standing for 
didj and d k di in F<- 2 \ We use this notation of the index of the matrix element 
in the sequel. 



Lemma 2 We have 



p(3) F (3) + Q(3) F (2) + #(3) F = . 



(15) 



Here, P^ , and are an invertible m x m matrix, an m x c?(d + l)/2 

matrix and an m x (2d + 2) matrix of polynomial entries respectively. Entries 
are defined as follows. 



p(3) 
ijk,abc 



o (3) 

ijk.ab 



R 



(3) 
ijk,a 



where 



+2(x kk - Xjj)S ai S b jS ck + (£fe,d+l - l^sAAsAfc 

+ J2ljtj,k { X kl S 'abc;tjl ~ X 3l 5 'abc-m + X jkSk,d+lS' abc . ia 
XijS a iSbiS c j ~\~ 2{xjj •Eii)o~aiO'bjO'cj Xij6 a j6bjS c j 

+ f x iI^o6c;iji — X H^'abc;jjl S j 

Y?s=l( X s-d+lS'abc;is,d+l ~ 2 i X d+l,d+l ~ x ii)^abc;iss 

+ Ei^i X H^'abc:lss) 

(1 - 'I',, )///, '■>',., 4, - yjSaiSbk (i < j < k < d + 1) 
y.j8 ai 8bj (i<j = k<d+l) 

Vd+iSaiSbM+i (i=j = k<d+l) 

-Xjkr 2 6k,d+lS a ,i+l - Stj5a,k+1 + VkStj5a,i+d+2 
— yi0~a.j+d+2 + 0~a,i+l 

-yir 2 S a i + (2(xd+x,d+i - xu)r 2 + l)5 a ,i+i 

~ Xiir 2 5 a j+1 + J2l<d+1 Vi^a.l+d+2 



(i<j<k<d+l) 



(i < j = k <d 
(t = j = k < d 



1) 
1) 



(i < j < k < d + 1) 
(i < j = k < d + 1) 

(i = j = k < d + 1) 



6' 



abc;ijk 



1 {d a d b d c = didjd k ) 
[dadbdc ^ didjdk) 
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Proof. We denote by the differential operator ([5]) in /; we put 

= x tj df + 2(xjj - xu)didj - Xijdj + 2J {xk J d l d k - x ik djd k ) + yjd l - y 4 <9j. 

Define a differential operator Gijk (i < j ' < k < d + 1, j ' < d) by 

fftCj-fc (i < j < k < d + 1), 

Gijk = I djCij (i < j = k < d), 

{d d+1 Ci,d+i (i = j = k < d). 

We expand Gijk in the ring of differential operators and express it in terms of 
the elements of F, F 1 - 2 ' , and F^ 3 \ For example, when i < j < k < d + 1, we 
have 

Gijk = diCjk 

= d t (xj k dj + 2(x k k - Xjj)djd k - x jk dl 

+ ^2 (xikdjdi ~ Xjidkdi) + y k d.j - yjd k ) 
i^j.k 

= Xjkdid 2 + 2(x k k - x j j)d l djdk - Xjkdid\ 

+ ^2 (xkidid 3 di ~ Xjididkdi) + y k d l d j - yjdid k 
ijtj,k 

which yields P^jj , P-jLjk . ■ • • . Qm,ij > Qm,ik ■ Analogous expansions and rewrit- 
ings for the other cases give the conclusion. Q.E.D. 

We denote by Mat(fc, I, S) the space of the k x I matrices with entries in the 
set S. 

Lemma 3 The vector F satisfies the identity 

AdiF = BF + CF {2) + EF^ . (16) 

Here, A = (a m ) e Mat(2d + 2, 2d + 2, C[x, y, r]), B = (b pj ) G Mat(2d + 2, 2d + 
2,C[x,y,r]),C = (c pjfe ) e Mat(2d + 2, d(d + l)/2, C[x, y,r\), E = (e p , jM ) e 
Mat(2cZ+ 2,m,C[x,y,r]) and A is invertible in the space of the matrices with 
entries in C(x,y,r). Explicit expressions of these matrices are given in J17| ), 

HW, (M, W), WM, (M>, tM>> ©• Note that 4 C , E depend on the 
index i. 

Notation: c p jk means the element at the p-th row of C and the column of C 
standing for djdk = dkdj. e p jk£ is defined analogously. 

Proof. The both sides of (|16[) is a column vector of the length 2d + 2. We 
will determine the rows of A, B, C, E from generators of /. Note that the index 
i is fixed over the proof. 
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The first rows. The first element of the vector diF is di, then we have 

an = 1, = 1 (17) 

and the other elements of the first rows of A, B, C, E are 0. 

The (j + l)-th rows (1 < j < d, i ^ j). Using the differential operator ([5]) in 
J, we have 

a:^ 2 + 2(xjj - x^didj + ^2 -''I. ,'),')•. = Xijd? + ^ x ik d 3 d k + y t d-j - y 3 di. 

k^i,j kjtij 

Therefore, we may put as 

Oj'+M+l = a j + l-j + l = 2\ x jj ~ x ii)i (18) 

aj+i,k+i = x kj (l<k<d+l,k^i,k^ j), 

— Vil &J+l,t+l = — &j + lj+d+2 = Xy, 

Cj+ijfc = cc, ;fe (1 < < d + 1, fc ^ i, fc 7^ j). 

The other elements of the (j + l)-th rows of A, B, C, E are 0. 

The (i + l)-th rows. The (i + l)-th element of the vector diF is df. When 
% < d, we put 

Cti+l,i+l = 1, &i+l,i+d+2 = 1 (19) 

and the other elements of the (i + l)-th rows are 0. When i = d+ 1, we consider 
the operator ((4|) in the ideal /. Then, we have 



d 

,2 
k 

k=l 



and hence we put 

ad+2 : d+2 = 1, (20) 
bd+2.1 = r 2 , b d+2 ,k+d+2 = -1 (1 < k < d). 

The other elements of the (i + l)-th rows of A, B, C, E are 0. 

The (d+2)-th rows. When i = d+l, it is reduced to the case of the (i + l)-th 
rows. We assume that i < d. Using the operators ((SJ) and ^ in /, we have 

Xt,d+idf + 2(x d +i.d+i - xu)did d+ i + ^2 x k ,d+idtdk 

k^i,d+l 

= x hd +idj +1 + ^ Xikdd+idk + Vidd+i - Vd+idi 

k=ii,d+l 

d 

= x l . d +ir 2 - ^x^d+idl + ^ x ikdd+idk + Vidd+i - Vd+idi- 

k=l k=£i,d+l 



Hence, we put 

&d+2,i+l = £»,d+l) Cld+2,d+2 = 2(aJd_|_i 5 d_|_i — Xu), (21) 

a<i+2,fc+i = (1 < k < d, k ^ i), 

bd+2A = Xi,d+lT 2 , bd+2,d+2 = Vh bd+2,i+l = — J/d+1) 

^d+2,;+d+2 = — a5»,d+i, (1 < I < d), 
Cd+2Md+i) = x ik {^<k<d,k^i). 

The other elements of the (d + 2)-th rows of A, B, C, E are 0. 

The (j + d + 2)-th rows (1 < j < d, i ^ j). Using the operator ([5]) multiplied 
by 9j from the left hand side, we have 

-2{xjj-xn)did] = Xijdfdj-Xijdj+ (.r h <),<),i),, - Xikd^d^+yjdidj-yidj+di. 

When i < d, we put 

a.j+d+2j+d+2 = — xu), (22) 

bj+d+2,i+l = 1, Oj+d+2,j+d+2 = ~2Aj 
e j+d+2,iij = ^-ij) e j+d+2,jjj = ~ x ijj 

e j+d+ 2,ijk = Xkj, e j+d +2,jjk = -Xik (I < k < d+l,k ^ i,k ^ j). 

The other elements in the ( j + d + 2)-th rows of A, B, C, E are 0. 
When i = d + 1, We use the operator §4§ anc j obtain 

-2(xjj - x d +i,d+i)d d +idj 
= Xd+x,jr 2 dj - 2xd+\ t jdij + yjdd+idj - yd+id 2 + d d +i 
+ 2J (xkjdd+idjdk - x d +i,kdjd k - x d+ \,jdjdl) . 

kjtd+l,j 

Therefore, we may put as 

a j+d+2,j+d+2 = —^{Xjj ~ Xd+l,d+l)i (23) 
bj+d+2,j+l = %d+l,j r ) ^j+d+2,d+2 = 1, i>?'+d+2,j+d+2 = — 2/d+l; 
Cj+d+2J(d+l) = Vj, 
e j+d+2,jjj = — 2Xjj, 

ej+d+2,ijfc = Ejy, e i+£i+ 2,iifc = -a:*, e J+d +2jkk = (1 < k < d, k =^ j). 

The other elements of the (j + d + 2)-th rows of A, B, C, E are 0. 

The (i + d + 2)-th rows. We may assume that i < d. Since the (i + d + 2)-th 
element of the vector diF is df, we put 

0>i+d+2,i+d+2 = 1, ei_|_d+2,nj = 1- (24) 

The other elements of the (i + d + 2)-th rows of A, B, C, E arc 0. Q.E.D. 
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From the Lemmas [TJ [2J [31 we have the following theorem, which gives a 
differential equation satisfied by the normalizing constant with respect to the 
variable j/j. As we remarked in the Lemma [3J we note that A,B,C,E depend 
on the index i and we omit to denote the dependency. 

Theorem 1 Put 

Hi = A- 1 (B - CipWyiQW + E(P^)- 1 (q( 3 )(P< 2 ))- 1 Q( 2 ) - . 

Then, we we have diF = H^F . 
Proof . 

diF = A~ 1 {BF + CF {2) +EF {5) ) by the Lemma 1 

= A- 1 (BF + CF& - E(P^)- 1 (q {3) F^ + i?(3)F)) 
by the Lemma [2] 

= A- 1 (BF - C{P^y l Q^F - E(P^)- 1 (-Q^iP^y^F + R {3) F 
by the Lemma [1] 

= A- 1 (B - CipWytQW + E(P^)- 1 (Q( 3 )(P( 2 ))- 1 Q( 2 ) - i?( 3 ))) F 
Q.E.D. 

In the case of d = 1 and for the y\ direction, these matrices are as follows. 



F = (1 d 1 d 2 d 2 ] 



A = 



C 



(\ 

1 

x 12 

\0 

(°\ 




w 



E = 






-2x u + 


/o 0\ 



\i oy 



0^ 




( 


1 





^ 





, B = 











l 





r 2 xi2 


-2/2 


2/1 


-Xl2 


l J 




\ 








o J 



P (2) = (-2xii + 2x 22 ) , = {-r 2 x 12 y 2 -yi 2x 12 ) , 



P( 3 ) = 
= 



2xu - 2x 22 

2X12 



2X12 

-2xn + 2x 2 2 



)(3) _ 



yi 

-yi 



-r 2 yi -2r 2 xn + 2x 2 2r 2 + 1 
-?' 2 xi2 



-r 2 xi2 y\ 

-1 2/2 
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The relation between dijF and F can be easily obtained by (flB)) . In fact, 
we have dijF = didjF = djdiF by (J3j) , then we have 

d t3 F = djdiF = djiHtF) = -^-F + HtidjF) = + HiHjj F. (25) 

Note that the matrices A,P( 2 \P^) depend only on x and r. Then, ^-p- has a 

relatively simple expression in terms of the matrices A, B, C, P, PW j QM, pM. 
We denote by Hij the matrix + HiHj. The matrix such that d r F = H r F 
can be obtained easily by utilizing Thus, we have obtained the relations 

diF = HiF, dijF = HijF, d r F = H r F, (26) 

which will be used for the holonomic gradient method and descent. 

In [5J, we prove that the holonomic rank of I is equal to 2d+ 2. Therefore, 
the Pfafhan equations are expressed in terms of (2d + 2) x (2d + 2) matrices. 
Our matrices obtained are nothing but these. The integrability conditions of 
Pfafhan equations imply ^ L + HiHj = + HjHi. 

In [TU] , the differential equation satisfied by the likelihood function for d = 1 
and d = 2 are derived by a heavy Grobner basis computation and we could 
not obtain them for d > 3. By virtue of Theorem [TJ we can describe the 
differential equation satisfied by the likelihood function with relatively small 
sized matrices with polynomial entries and the inverses A- 1 and (P^)- 1 for 
the general dimension. If we calculate these inverse matrices by a symbolic 
computation, we obtain the same result with the Grobner basis method. In order 
to apply for the HGD, we do not need to calculate these inverse matrices with 
polynomial entries symbolically, but we need only inverse matrices numerically 
when variables are specialized to real numbers in each step of the Rungc-Kutta 
method. This will become a key ingredient of our algorithm, which will be 
discussed in the section [5l 



3 Series Expansion for the Normalizing Con- 
stant 

Let us define the function Z by the integral 

Z(x,y,f)= [ expf y2(£iti +yiti) J \dt\. (27) 

The function satisfies the invariance relation 

Z(5,y,l) = Z(r- 2 5,r- 1 y,r). (28) 

This function is nothing but the restriction of the normalizing constant Z to 
the diagonalized x. Since the normalizing constant is invariant under the action 
of the orthogonal group O(d), we can express F(Z) in terms of F(Z). The 
following proposition can be obtained by a straight forward calculation. 
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Proposition 1 We suppose that the real symmetric matrix x is diagonalized by 
an orthogonal matrix P = (pij)- Put x = P T xP, y = Py, f = r. Then, we have 



Z(x,y,r) = Z(x,y,r) 

dZ ™ dZ 

— {x,y,r) = >p lk ^-{x,y 7 r) 

% £ri dy k 

d 2 Z, . £i a 8 2 Z^ . 
-^■(x,y,r) = ^ Pik -^.{x,y,r) 

PikPu ^ ( dZ dZ \ 

Wc will give a series expansion of the normalizing constant and an estimate of 
the truncation error. We note that Kume and Walker give a series approximation 
of the Fisher-Bingham distribution and consequently that of the normalizing 
constant [8]. Our expansion contains the parameter r and is different from 
theirs and is fit to the holonomic gradient method discussed in the next section. 
We will omit ~ for x and y in the sequel if no confusion arises. 

Theorem 2 1. The restricted normalizing constant has the following series 
expansion 

Z(xyr)-S rl - V r ^l^l ( rf - 1 ) !! nt + i(2^ + 2ft-l)!! 2g 

(29) 

Here, Sd = Jgd^ \dt\ denotes the surface area of the d-sphere of radius 
I, and No = {0,1,2,...}. For a multi-index a € Ng +1 , we put a\ = 
Y[f=i cti ] -, ot\\ = Y[itl a i", and \a\ = X)i=i a i- 
2. The truncation error of the series is estimated as 

V rd+ 2\ a+0 l (d 1)!! I&M + 2ft - 1)!! 20 
^> N (d-l + 2|a| + 2|/J|)!!a!(20)! V 

r d ( \ N J\T i 1 

We note that the series (|2"5|) converges slowly when r 2 ^Zid 2 -* I + \Vi\ 2 ) > 1 
and converges relatively rapidly when r 2 ^j(|xi| + |yi| 2 ) < 1. The derivatives 
of Z are expressed as derivatives of the right hand side of ([29]). 

Proof . Put g(x,y,t) = X^=i( :E ^ 2 + ViU)- Then, the integrand is expanded 

as 

oo ^ 

cxp( 3 (x, y, t)) = V — rflr(ar, y, i)" (31) 

n=0 
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which is uniformly convergent on any bounded set. Therefore, the Fisher- 
Bingham integral is expanded as 



oo „ 

Z{x lV ,r) = Y, -.g{x lV ,t) n \dt\. (32) 



n=0 J S d (r) 

We redefine g and define h as 



d+1 d+1 

g = ^2x l tf, h^^yrfi. (33) 

i=l t=l 

Applying a well-known expansion 

(xi + ■ ■ ■ + XvT = = E r *? 1 ' ' ■ ^ . (34) 

a\ ai'.-'-aJ. F 

\a\=n QlH ha p =n r 

we obtain the n-th term of the series expansion of the Fisher-Bingham inte- 
gral 

-Ag + h) n \dt\ 

S d (r) n - 

- £ / si(r) (e a*"**) ■ (e w 

fc+2C=n 6 W \\a\=k ) \\0\=i V ' / 

- E EE^/ S ,/ M "w 

fc+2^=n|a|=fc|,S|=£ v ' ' Jb (r > 

9 V V V r 2(fc + ^ ^ (d-l)!!n^i(2"« + 2A-l)!! 
d a!(2/3)! (d-l + 2|a + fl|)l! 

The last rewriting is obtained by integrating monomials on the sphere. See, 
e -g-i DP- Thus, we have proved the first claim. 
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Let us show the second claim. We have the following estimates 



E 



< 



< 



< 



\a+/3\>N 

E 

\a+p\>N 

E 

\a+0\>N 

E 

\a+@\>N 



d +2 | q+g | (rf ~ 1)" nf=i + 2A - 1)! 

(d-l + 2|a|+2|0|)!!a!(20)! 



a. ,20 



x y 



„ d+ 2\ a+P \ (^i)!'nt + i(2»» + 2ft - m p 



(d-l + 2|a|+2|0|)!!a!(20)! 



-x y 



„d+2\a+P\_ 



1 



a!(2/3)! 



1 



x a y 20 



,d+2|a+/3|_|_ a 2/3 

a!/3! ^ 



,.2jj 



n>JV \ a +j3\=n r 



< r 



r 2n fd+1 \" , / d+1 

d E^r £(w + i^ 2 d ^ E h K E(N + itfi) 



1>W 



1=1 



Put L(x, y, r) = r 2 J2i=i (\ x i\ + \vf\)- When N is sufficiently large, we have the 
estimate 



n>N 

by the estimate 



1 r d 

V" L(x,y,r) n < —L(x,y,r) N 



7V + 1 



+ 1 - L(x,y,r) 



(35) 



E -V 



i>N 



V! ^ (JV+l)n- 
1 °° 1 

AH <WiV+l)„ 

71=0 V 



-JV 



A' 



< 



1 °° 1 

AH ^ (iV + 1)" 

n=0 V ; 



AH iV + l-L 



Q.E.D. 



4 Numerical Evaluation of Normalizing Constant 

In order to efficiently evaluate Z numerically, we use the method introduced 
in [TT], which they call the holonomic gradient method (HGM). The HGM is 
a method to evaluate the normalizing constant by utilizing a system of differ- 
ential equations. In the case of the Fisher-Bingham distribution, we evaluate 
numerically the series ([2]) in the domain r 2 X^G 3 -*! + \Ui\ 2 ) ^ 1 an( i extend the 
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numerical evaluation out of this domain by a differential equation with respect 
to r. To use this method, we prepare the following Theorem. 

Theorem 3 1. The function Z is annihilated by the left ideal I generated by 

Ai = dl-d Xi (1 <*< d+1), 
B = d 2 H h d 2 -r 2 

Cij = 2(xi - Xj)d yi d y . + yid V] - y 3 d Vi (1 < i < j < d + 1), 

d+1 d+1 

E = rd r - 2 X K ~ £ y> d v> - d - 

i=l i=l 

2. PutF = (d Vl ,...,dy d+1 ,d 2 1 ,... 7 d 2 d+1 ) T . Then, we have d r F = pMp modi . 
Here, the matrix P^ = (pjj^) is defined by 

d+1 

rpj 5 = (2x t r 2 + 1)% + ^ yi8 j(k+d+1) (1 < * < d + 1), 

fe=i 

r P([+d+i)i = 2/i r2<5 u + ( 2a; * r2 + 2 )%+d+i) + £ ^(fe+rf+i) (! < * < d + !) 

/or 1 < j < 2o!+ 2. 
The proof of this theorem is analogous to that for the non-diagonal x case. 
Example 1 In the case of d = 1, the matrix P^ is 

/ 2r 2 Xl + l yi Vi \ 

1 2r 2 ai 2 + l y 2 y 2 

r r 2 yi 2r 2 xi +2 1 

V r 2 y 2 1 2r 2 x 2 + 2 / 

We note that the maximal eigenvalue of P^ is 0(r). Our implementation 
of the HGM solves numerically the ordinary differential equation 

^ = (P« - r\E)G (36) 
or 

instead of solving d r F{Z) = Pv>F(Z), where A is the maximal eigenvalue 
of lim r _j. +00 p( r ' jr and the vector valued function G is defined by F(Z) = 
exp(Ar 2 /2)G. This scalar scaling is necessary, because the adaptive Runge- 
Kutta method requires an absolute error bound of the solution to automatically 
make meshes finer and when a solution grows exponentially, meshes become too 
small to keep the absolute error bound. 

Let us discuss on the accuracy of the HGM. The truncation error of the series 
approximation is estimated in Theorem [2] We want to estimate the numerical 
error of the normalizing constant by applying the HGM. In other words, we want 
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to estimate how much the truncation error is magnified by solving the ordinary 
differential equation numerically. We use a practical method to do this. In this 
method, we suppose that initial values are governed by a probability measure. 
This assumption is natural in, e.g., the molecular modeling and some classes 
of non-linear ordinary differential equations are studied under this assumption 
(see, e.g., [5]). In our case, the equation is linear and the problem is much 
easier. Since we do not find a relevant reference fit to our case, we include 
below a discussion on the behavior of solutions of linear equation under random 
initial data. For a given initial value vector Zq at r = r$ evaluated by series, we 
denote by RZ$ the output obtained at r = r\ by solving the ordinary differential 
equation where R is a constant matrix, because the Runge-Kutta solver can be 
regarded as a linear map from the input to the output. (Here, we assume that 
round-off errors and cancellation errors by floating point number arithmetic are 
sufficiently small.) When Zq is regarded as a random vector distributed as 
a multivariate normal distribution, the output RZq is also a random variable 
distributed as a multivariate normal distribution. In our implementation, we 
perturb Zq with random numbers of which standard deviation is e/2 where s is a 
truncation error and solve the ordinary differential equation for these perturbed 
initial values. We evaluate the mean and the standard deviation of the first 
component of RZq and these give an evaluation of the normalizing constant 
and its statistical error bound. This bound is more practical than that by the 
interval arithmetic. 

For example, the Figure [1] shows a histogram of the normalizing constant 
which is generated by the above procedure with e/2 = 0.1, ro = 1, r\ = ^ \%i\ + 
J2yf, d = 3, x = diag(1.2,2.5,3.2,3.6), y = (2.3,5.3,4.2,0.1). The series 
is evaluated at x/r\,y/r\,r = 1 and extended to r = n. In this case, the 
standard derivation of the normalizing constant is evaluated as 156.6288 and 
the confidence interval with probability 0.95 is [14065.6, 14679.6] 

Kume-Wood [7] gave a saddle point approximation of the normalizing con- 
stant for the Fisher-Bingham distribution. Our method evaluates the normal- 
izing constant with an error bound. The Table Q] shows values by the HGM 
and by the 3rd order approximation of the saddle point method by Kume and 
Wood. Here, d = 4 and (x^- ) = diag(xn, 2xn, 3xn, 4xn, 5xn), 0.5 < x\\ < 10, 
(Vk) = (0.5yo,0.4yo,0.3yo,0.2yo,0.1yo), Uo = 3. The absolute error bound to 
solve (|31)|) by the adaptive Runge-Kutta method is set to 10 -6 X^=i~ 2 G i( /(2e?+2) 
and £ is 10" 5 . The values of the standard deviation imply that the values by 
the HGM have at least 6 digit accuracy with 95 % confidence. 

5 Algorithm and Numerical Results 

In [TD1 Algorithm 1 and Theorem 2], we give an algorithm to obtain the MLE 
for the Fisher-Bingham distribution. This algorithm works in a general dimen- 
sion, but on current computers, it does not work for more than two dimensions 
because of a high computational complexity of the Grobner basis computation. 
We replace the part of the Grobner basis computation (steps 1,2,3 in [TQl Algo- 
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Figure 1: Histogram of normalizing constants by the HGM with randomized 
initial values 

rithm 1]) with our derivation of the Pfafnan system given in the Theorem [1] and 
the part of the numerical integration of ([2]) with the evaluation by the series ((2]) 
and extend values to slowly convergent domains of the series by the HGM. For 
efficiency, we calculate the inverse matrices in our expressions of Hij and Hi nu- 
merically during the steps of the adaptive Rungc-Kutta method. This leads us 
solving the maximum likelihood estimate problem in more than two dimensions 
case with the HGD. More precisely, we have the following complexity result. 

Theorem 4 The complexity of the series expansion method, the HGM and the 
HGD for the Fisher- Bingham distribution on the d-dimensional sphere is 

0((2d + 2) N+1 /N\) + (complexity of solving the ODE with respect to r) 
+0((2d+ 2) 3 ) x (steps of the convergence of gradient descent). 

The first and the second terms are the complexity to evaluate the initial values 
F(Z) up to the degree N and the third term is the complexity of the HGD. 

Proof. The number of terms of the truncated series of (|29[) is (24+2) = 
C d+ N +N ) = 0((2d + 2) N /Nl). The coefficients of the series can be evaluated by 
recursive relation. We need 2d + 1 derivatives of Z. Thus, we obtain the first 
term. 

Our HGD requires to compute inverse matrices of (2d+2) x (2d+2) matrices 
on each step of the HGD. Then, we obtain the third term. Q.E.D. 
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HGM 


Kumc-Wood, 3 


HGM 


standard deviation 


0.5 


189.243 


1.737976c-04 


189.763 


1.0 


985.529 


9.102497e-04 


994.043 


1.5 


5856.78 


5.424156e-03 


5808.16 


2.0 


39075.8 


3.624707c-02 


37602.6 


2.5 


287231 


2.667160c-01 


271557 


3.0 


2.28420c+06 


2.122623c+00 


2.15158c+06 


3.5 


1.93448c+07 


1.798630c+01 


1.82924c+07 


4.0 


1.72236c+08 


1.602082c+02 


1.63939c+08 


4.5 


1.59584c+09 


1.484901c+03 


L52931c+09 


5.0 


1.52663c+10 


1.420891c+04 


1.4717c+10 


5.5 


1.49868c+ll 


1.395204c+05 


L45179c+ll 


6.0 


1.50274c+12 


1.399244e+06 


1.46123c+12 


6.5 


1.53345c+13 


1.428082e+07 


1.49556c+13 


7.0 


1.58797c+14 


1.479060c+08 


L55222c+14 


7.5 


1.66504c+15 


1.551038c+09 


L6302c+15 


8.0 


1.76459c+16 


1.643961c+10 


1.7299c+16 


8.5 


1.88748c+17 


1.758618c+ll 


L85223c+17 


9.0 


2.03531c+18 


1.896519c+12 


1.99905c+18 


9.5 


2.21040c+19 


2.059834c+13 


2.1716c+19 


10.0 


2.41579c+20 


2.251392c+14 


2.37462c+20 



Tabic 1: Normalizing constants 



We implemented our algorithm firstly on Maple and next on the language C 
and the GNU scientific library [3]. The prototype in Maple is useful to debug 
our C codes. Our C code is automatically generated by our code generation 
program pf n_gen_c_2 . rr on Risa/Asir |12j . We present some examples to 
illustrate the performance of our new algorithm and new implementations. 

Example 2 The problem "Astronomical data" given in [10] is solved in 2.58 
seconds on a 32 bit virtual machine, of which host machine is running with the 
Intel Xeon E5410 (2.33GHz) processor. 

The following timing data is taken on the same machine. Figures in the following 
tables are given in seconds. 

Example 3 We solve 2 problems on the 3 dimensional sphere. Input data 
are generated by a random number generator, which uses the Neumann rejec- 
tion method, for the Fisher-Bingham distribution with the parameter (x, y) = 
(X, Y) . The starting point of the HGD is (X, Y) itself. Our implementation, 
input data of this example and the following examples are obtainable from 
http : / / www . math . kobe-u . ac . jp/0penXM/Math/Fisher-Bingham-2, The tim- 
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ing data is summarized as follows. The programs use the Fletcher-Reeves con- 
jugate gradient algorithm (FR method) to find a local maximum. The iteration 
terminates when the square norm of the gradient is less than 0.001 or the number 
of iterations reaches to 1000. 



Problem 


Time (initial value) 


Time (HGD) and steps 


Total 


s3-el 


0.016 


5.6 (78 steps) 


5.6 


s3-e3 


0.028 


3.8 (53 steps) 


3.8 



(C programs (ko-fbd.c and s3_el.c)) 



Example 4 3 problems on the 4 dimensional sphere. 



Problem 


Time (initial value) 


Time (HGD) and steps 


Total 


s4-el 


0.128 


24 (99 steps) 


24 


s4-e2 


0.144 


31 (121 steps) 


31 


s4-e3 


0.124 


20 (72 steps) 


20 



(C programs (ko-fbd.c and s4_el.c)) 



Example 5 3 problems on the 5 dimensional sphere. 



Problem 


Time (initial value) 


Time (HGD) and steps 


Total 


s5-el 


0.624 


104 (147 steps) 


105 


s5-e2 


0.652 


75 (98 steps) 


76 


s5-e3 


0.672 


82 (107 steps) 


83 



(C programs (ko-fbd.c and s5_el.c)) 



Problems in the dimensions 6 and 7. 



Problem 


Time (initial value) 


Time (HGD) and steps 


Total 


s6-el 


2.9 


291 (132 steps) 


294 


s7-cl 


13 


973 (151 steps) 


986 



(C programs (ko-fbd.c and s6_el.c, s7_el.c)) 



6 Conclusion and Open Problems 

We show that the HGD can solve some MLE problems up to dimension d = 7 
by utilizing an explicit expression of the the Pfaffian system and the series 
expansion of the normalizing constant. However, there are two problems in 
applying our method to arbitrary data. 

1. In examples, we can choose a starting point to search the MLE by the HGD 
sufficiently close to the the MLE, because these examples are generated by 
a random number generator. Finding a good starting point for arbitrary 
data is an open question. 

2. There are domains in the (x, j/)-space where the normalizing constant can- 
not be evaluated with a given accuracy in a reasonable time, because the 
normalizing constant is a huge number in these domains. 
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Although, there still remain important open problems, our new method gives a 
method to evaluate normalizing constant and its derivatives with accuracy for 
sufficiently large set of parameters and solves MLE problems when a starting 
point, which is sufficiently close to the answer, is given. 
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